Before we begin please note anything bolded is either a finding or an important comment
This blog illustrates the use of R to analyze COVID-19 Outbreak in Peru by applying different algorithms in an attempt to gauge the spread of the virus and success of public health containment efforts in reducing the speed and degree of spread.
For this analysis, I’m laboring Tim Churche’s blog which containings a detailed analysis of COVID-19 data for China.
This analysis is for informational purposes only and should not be relied upon for to inform governmental or health care policy.
Data utilized for this analysis comes from Coronavirus wiki, Peruvian Ministry of Health MINSA, Outbreak Dashboard and nCOV19 package. The use of these pages through web scraping can be challenging, however, because the format of the tables in which the data appears changes many times per day.
The biggest problem with these analyses is that they are not based on data tabulated by date of onset (of symptoms), but rather on data tabulated by date of reporting, or perhaps date of confirmation by lab test. We understand that lab results from molecular tests can take up to 7 days to be published while lab results from serological tests are not reliable due that test looks for antibodies within the blood stream which are usually developed from 7 to 30 days after contracting the virus.
If lab testing and reporting is swift and the lag is uniform, then the date of reporting or confirmation will lag consistently and only slightly behind the date of onset. Sadly that’s not usually the case, there are tipacally variable delays in lab confirmation and/or reporting, leading to backlogs of cases being reported all at once (known as a “data dump” by outbreak epidemiologists). This is very unfortunate, and it frustrates efforts to properly assess outbreak trajectories and effectiveness of intervention strategies. One solution would be for authorities to provide line-listings of all cases, including their date of reporting and their date of onset of symptoms, even if only presumptive dates are provided (and missing dates of onset can be imputed, in any case).
The effect of the unavoidable (because that’s all that is available) use of data tabulated by date of reporting here (and just about everywhere else) means that reproduction numbers and related measures of infectiousness may be inflated, particularly at first. Nonetheless, the methods illustrated here show the potential utility of epidemic trajectory analysis – but that utility would be so much greater if health authorities published data tabulated by date of onset, or better, line-listings that include date of onset. It would be very easy for them to do that.
This initial analysis is meant to visualize and forecast COVID-19 spread in Peru.
Peru COVID-19 dataset was updated on: 28Jun20
The analysis presented here primarily uses R packages developed and published by the R Epidemics Consortium (RECON).
The functions provided by the RECON incidence package are used to fit log-linear models to the growth and decay phases of the outbreak in Peru, and the decay phase model is used to very approximately predict when local transmission will be extinguished there.
I also utilize earlyR and EpiEstim packages, which are also published by RECON. In particular the get_R() function in earlyR calculates a maximum-likelihood estimate (from a distribution of likelihoods for the reproduction number) as well as calculating λ, which is a relative measure of the current “force of infection”:
$$ \lambda = \sum_{s=1}^{t-1} {y_{s} w (t - s)} $$
where w() is the probability mass function (PMF) of the serial interval, and ys is the incidence at time s.
The resulting λ “force of infection” plot indicates the daily effective infectiousness (subject to public health controls), with a projection of the diminution of the force of infection if no further cases are observed. The last date of observed data is indicated by the vertical blue dashed line. New cases are shown in a cumulative manner as black dots. It is a sign that the outbreak is being brought under control if λ, as indicated by the orange bars, is falling prior to or at the date of last observation (as indicated by the vertical blue line). Note that left of the vertical blue line the λ values are projections, valid only in no further cases are observed. As such, the plot is a bit confusing, but it is nonetheless useful if interpreted with this explanation in mind. The RECON packages are all open-source, and easier-to-interpret plots of λ could readily be constructed.
The critical parameter for these calculation is the distribution of serial intervals (SI), which is the time between the date of onset of symptoms for a case and the dates of onsets for any secondary cases that case gives rise to. Typically a discrete γ distribution for these serial intervals is assumed, parameterised by a mean and standard deviation, although more complex distributions are probably more realistic.
For the estimation of the force of infection λ, for the serial interval we’ll use a discrete γ distribution with a mean of 5.0 days and a standard deviation of 3.4.
After having an understanding on what is our business problem, initial datasets are loaded for data preparation and exploration prior model development.
Let’s begin by loading the libraries and functions that will be used for this analysis:
## plyr dplyr data.table
## TRUE TRUE TRUE
## tibble ggrepel deSolve
## TRUE TRUE TRUE
## stringr stringi forecast
## TRUE TRUE TRUE
## tidyverse broom gt
## TRUE TRUE TRUE
## caret zoo plotly
## TRUE TRUE TRUE
## DT tidyr rvest
## TRUE TRUE TRUE
## hrbrthemes inspectdf DataExplorer
## TRUE TRUE TRUE
## ggplot2 ggthemes gghighlight
## TRUE TRUE TRUE
## dlm PerformanceAnalytics projections
## TRUE TRUE TRUE
## earlyR ggdendro rgdal
## TRUE TRUE TRUE
## tsibble EpiEstim epitrix
## TRUE TRUE TRUE
## lubridate distcrete shadowtext
## TRUE TRUE TRUE
## nCov2019 magrittr incidence
## TRUE TRUE TRUE
Raw data for each Peru COVID-19 dataset are loaded.
The COVID-19 dataset (COVID_19_Peru_Raw) has 79 rows and 15 columns.
Please see below for COVID-19 Outbreak Worldwide Analytics updated on 27Jun20.
Here we observe Worldwide Outbreak trend after the 100th reported possitive case:
Below we find a geographic heatmap based on possitive confirmed cases up to date:
Heatmap of epidemic situation around the world in the last 7 days (log scale) for top 25 countries with higher number of possitive cases.
First, let’s look at the daily cumulative number of incident, lab-confirmed cases for Lima, for all the other provinces combined, and for all of Peru.
The initial increases for Lima and the balance of the other provinces, and for all of Peru look to be approximately sigmoidal, as is expected for epidemic spread. Let’s plot them on a logarithmic y axis. We would expect to see a linear increase on a log scale if the epidemic curve is indeed exponential.
One could convince oneself that log-linearity is present.
Let’s also look at the daily incremental incidence. This is more informative, and is known in outbreak epidemiological parlance as the epidemic curve. It is traditionally visualised as a bar chart, which emphasises missing data more than a line chart, even one with points as well. We’ll adhere to epidemiological tradition.
At the time of writing (28Jun20), it looks like incidence has yet to plateaued in Peru.
Now let’s look the daily (incremental) number of deaths in lab-confirmed cases for all provinces in Peru combined.
Let’s also look at the daily incremental deaths in lab-confirmed cases.
Here I will use Google’s mobility data to understand how social distancing and quarentine measures have been working in Peru.
Here I will analyze Non-pharmaceutical Interventions in Peru related to COVID-19 from the Oxford Covid-19 Government Response Tracker.
On 4 February 2020, data science blogger Learning Machines posted this analysis of the COVID-19 outbreak, in which he fitted the classic SIR (Susceptible-Infectious-Recovered) model to the incidence data for all of China. I’ll use his explanation of how to fit this model using R as based for this analysis.
The basic idea behind the SIR model of communicable disease outbreaks is that there are three groups (also called compartments) of people: those who are healthy but susceptible to the disease S, the infectious (and thus, infected) I and people who have recovered R:
Source: wikipedia
To model the dynamics of the outbreak we need three differential equations, to describe the rates of change in each group, parameterised by β which controls the transition between S and I and γ which controls the transition between I and R:
$$\frac{dS}{dt} = - \frac{\beta I S}{N}$$
$$\frac{dI}{dt} = \frac{\beta I S}{N}- \gamma I$$
$$\frac{dR}{dt} = \gamma I$$
The first step is to express these differential equations as an R function, which is easier than one might think – the expressions in the code are just direct translations of the differential equations, with respect to time t of course.
SIR <- function(time, state, parameters) {
par <- as.list(c(state, parameters))
with(par, {
dS <- -beta * I * S / N
dI <- beta * I * S / N - gamma * I
dR <- gamma * I
list(c(dS, dI, dR))
})
}To fit the model to the data we need two things: a solver for these differential equations and an optimiser to find the optimal values for our two unknown parameters, β and γ. The function ode() (for ordinary differential equations) from the deSolve package for R makes solving the system of equations easy, and to find the optimal values for the parameters we wish to estimate, we can just use the optim function built into base R. Specifically what we need to do is minimise the sum of the squared differences between I(t), which is the number of people in the infectious compartment I at time t, and the corresponding number of cases as predicted by our model Î(t). This quantity is known as the residual sum of squares (RSS)1.
RSS(β, γ) = ∑t(I(t)−Î(t))2
I now proceed to fit a model to the incidence data for all of Peru. We need a value N for the initial uninfected population. Once again we’ll scrape population data from a suitable wikipedia page.
The approximate population of Peru in 2017 was 31,237,385 people, according to this wikipedia page.
Next, we need to create a vector with the daily cumulative incidence for Peru, from 6th March when our daily incidence data starts, through to current date. We’ll then compare the predicted incidence from the SIR model fitted to these data with the actual incidence since 6th March. We also need to initialise the values for S, I and R.
# put the daily cumulative incidence numbers for Peru from 7th Mar to now into a
# vector called Infected
sir_start_date <- "2020-03-07"
# Infected <- Peru_Cumulative_Incidence %>% filter(province ==
# 'Peru_cases_increase', Date >= ymd('2020-03-07'), Date <= ymd('2020-04-07'))
# %>% pull(cumulative_incident_cases)
Infected <- COVID_19_Peru_Raw %>% filter(Date <= ymd("2020-04-11")) %>% select(Peru_molecular_total) %>%
pull(Peru_molecular_total)
# Create an incrementing Day vector the same length as our cases vector
Day <- 1:(length(Infected))
# now specify initial values for S, I and R
init <- c(S = N - Infected[1], I = Infected[1], R = 0)Then we need to define a function to calculate the RSS, given a set of values for β and γ.
# define a function to calculate the residual sum of squares (RSS), passing in
# parameters beta and gamma that are to be optimised for the best fit to the
# incidence data
RSS <- function(parameters) {
names(parameters) <- c("beta", "gamma")
out <- ode(y = init, times = Day, func = SIR, parms = parameters)
fit <- out[, 3]
sum((Infected - fit)^2)
}Finally, we can fit the SIR model to our data by finding the values for β and γ that minimise the residual sum of squares between the observed cumulative incidence and the predicted cumulative incidence. We also need to check that our model has converged, as indicated by the message shown below:
# now find the values of beta and gamma that give the smallest RSS, which
# represents the best fit to the data. Start with values of 0.2 for each, and
# constrain them to the interval 0 to 1.0
# get a good starting condition
mod <- nls(Infected ~ a * exp(b * Day), start = list(a = Infected[1], b = log(Infected[2]/Infected[1])))
lower = c(0, 0)
upper = c(1, 1)
optimsstart <- c(1, 2) * coef(mod)[2]
set.seed(12)
Opt <- optim(optimsstart, RSS, method = "L-BFGS-B", lower = lower, upper = upper,
hessian = TRUE)
# check for convergence
Opt$message## [1] "ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"
## [,1] [,2]
## [1,] -2.459711 -2.459480
## [2,] -2.459480 -2.459248
Convergence is confirmed. Now we can examine the fitted values for β and γ.
## beta gamma
## 0.3994840 0.2024697
Those values don’t mean a lot, per se, but let’s use them to get the fitted number of people infected to our SIR model for the dates up to 07Apr20, which is the latest day that Peru published molecular test results without serological tests, and compare those fitted values with the observed data.
That looks like a reasonably good fit to the observed cumulative incidence data, so we can now use our fitted model to calculate the basic reproduction number R0 which gives the average number of susceptible people who are infected by each infectious person:
$$R_{0} = \frac{\beta}{\gamma}$$
That’s very easy to calculate, and we get:
## R0
## 1.973056
An R0 > 1.5 is consistent the values calculated by others for COVID-19, and is also consistent with the R0 for SARS and MERS, which are similar diseases also cause by coronavirus.
An obvious next step is to use our fitted SIR model to make predictions about the future course of the outbreak. However, caution is required, because the SIR model assumes a fixed reproduction number, but if public health interventions have been implemented, such as quarantining of cases, contact tracing and isolation of those contacts, and general restrictions on social mixing, such as closing the country, then the effective reproduction number Re will be dynamic and should fall as those interventions are progressively implemented, to values considerably less than the basic reproduction number R0, which reflects the behaviour of the virus at the beginning of an epidemic before any response has been implemented.
So let’s use our SIR model, fitted to the first 79 days of data, to extrapolate out to the current date, and compare that against the observed values:
We can see that the actual incidence similar than that predicted by our model. The reason is that, even with public health interventions implemented by the Peruvian authorities, the Re of the COVID-19 in Peru hasn’t been reduced by much. This is something that will need to be addressed ASAP.
When the Re falls below 1.0, the peak of the epidemic will have been reached and the outbreak will eventually die out.
It is instructive to use our model fitted with up to date lab-confirmed cases data, to see what would happen if the outbreak were left to run its course, without public health interventions.
It is easier to see what is going on if we use a log scale:
I’s clear that, if outbreak follows model above, it would ended up quite bad due that Peruvian Health System is not design to support this pandemic.
Additionally, we observe that model above, which is based on molecular test results only, differs from official peruvian numbers mainly due that official numbers are a mix of serological and molecular test results which may lower accuracy on official numbers.
At this point, it is worth remarking on the importance of decisive public health intervention to limit the spread of such epidemics. Without such interventions, tens of millions of people could be infected, as our model predicts, and even with only a one or two per cent mortality rate, hundreds of thousands of people would die.
As noted above, the initial exponential phase of an outbreak, when shown in a semi-log plot (the y-axis with a logarithmic transform), looks somehow linear. This suggests that we may be able to model epidemic growth and decay using a simple log-linear model:
log(y) = rt + b
where y is the incidence, r is the growth rate, t is the number of days since a specific point in time (typically the start of the outbreak), and b is the intercept. Separate models are fitted to the growth and the decay parts of the epidemic (incidence data) curve.
## <incidence object>
## [119959 cases from days 2020-03-07 to 2020-05-24]
##
## $counts: matrix with 79 rows and 1 columns
## $n: 119959 cases in total
## $dates: 79 dates marking the left-side of bins
## $interval: 1 day
## $timespan: 79 days
## $cumulative: FALSE
I proceed to find the peak of confirmed possitive cases based on current available data.
Based on analysis above, we observe that Peru’s last peak for COVID-19 confirmed cases was on 2020-05-21.
Now I proceed to fit a log-linear model for the growth phase before the peak. I can plot the fitted values from our model (with confidence limits) on top of the actual observed possitive incidence data for Peru.
And here I will evaluate force-of-infection λ plot
| Model Fitted Statistics | |||
|---|---|---|---|
| Last update on: 2020-06-28 | |||
| Dates | R2 | Adj. R2 | Deviance |
| all | 0.86 | 0.86 | 58.43 |
This growth rates is equivalent to a doubling time of 7.5 days (95% CI 6.8 - 8.2 days).
The doubling time estimate is quite ilustrative on informing current public health intervention policy in Peru.
Assessment: Outbreak in Peru is not in control yet and may still growth if public health policies are not followed.
Based on log-linear model of the epidemic trajectory I’m able to estimate the reproduction number in the growth phase of the epidemic. I need to provide a distribution for the serial interval time, which is the time between the onset of a primary case and the time of onset in its secondary cases. I’ll use a mean of 7.5 days and a standard deviation of 3.4 days to parameterise a discrete gamma distribution for the serial interval based on analysis of just 5 primary cases amongst the first 450 cases in Wuhan, published by Li et al.. Here is a histogram of our calculated distribution of possible values for R0 for the growth phase, based on the log-linear model we fitted to the Peru incidence data:
mu <- 7.5 # days
sigma <- 3.4 # days
param <- gamma_mucv2shapescale(mu, sigma/mu)
w <- distcrete("gamma", interval = 1, shape = param$shape, scale = param$scale, w = 0)
growth_R0 <- lm2R0_sample(Peru_incidence_fit$model, w)
hist(growth_R0, col = "darkblue", border = "white", main = "R0 distribution in Peru")## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.741 1.791 1.832 1.829 1.862 1.919
Note that the central estimates for R0 are considerably higher than those we calculated with a SIR model fitted to the same Peru data, and are also higher than R0 estimates published elsewhere, but they are consistent with estimates of the instantaneous effective reproduction number Re which are calculated in the next section2.
It would be useful to estimate the current effective reproduction number Re on a day-by-day basis so as to track the effectiveness of public health interventions in Peru, and possibly predict at the earliest opportunity when an outbreak will turn the corner.
There are several available methods for estimating Re – . Here I will focus on one method, developed in 2013 by Anne Cori and colleagues at Imperial College, London, which permits estimation of the instantaneous effective reproduction number, which is exactly want we want in order to track the effectiveness of containment efforts. Full details are available in the original paper on the method, with extensions described in a later paper by Thompson et al..
Once again, I’ll start with the counts of lab-confirmed posstive cases in Peru from 06th March 2020 onwards.
A critical parameter for the Cori model is the serial interval (SI). The SI is the time between onset of symptoms of each case of the disease in question, and the onset of symptoms in any secondary cases that result from transmission from the primary cases. In other words, it is the time between cases in the (branching) chain of transmission of the disease. A moment’s reflection reveals that the SI is, in fact, a statistical distribution of serial interval times, rather than a fixed value. That distribution can be simulated, typically using a discrete gamma distribution with a given mean and standard deviation.
There are different models that allows uncertainty to be incorporated into the parameterisation of this distribution, and it even allows the SI distribution to be estimated empirically, using Bayesian methods, from individual line-listings of cases. We’ll examine all of those capabilities.
I’m currently using one published estimate of the serial interval distribution, derived from analysis of just 5 primary cases amongst the first 450 cases in Wuhan, published by Li et al.. They estimate the serial interval distribution to have a mean of 7.5 days with a standard deviation of 3.4 days. This is almost identical to the serial interval parameters for the MERS virus, which were a mean of 7.6 and a SD of 3.4. This similarity is not surprising given that both pathogens are coronavirii. Let’s use that to estimate the instantaneous Re for Peru.
Our first observation would be that the slope of the effective reproduction number curve have yet to show a prominent downward shift, which strongly suggests that containment efforts are yet to succeed in reducing transmission of the disease in Peru.
Our second observation would be that the 7-day sliding window estimates of instantaneous Re are very high, approaching 15 at the peak. This seems unlikely. It is possible that the Cori model is flawed but has been shown to accurately estimate Re using a wide range of historical outbreak data. There are, however, a number of possible alternative explanations.
To cover the possibility of some cases transmitting the disease very soon after infection, possibly before the onset of symptoms (so-called super-spreaders), and some cases being sub-clinical, and thus undetected, spreading the disease as well, while other cases have a serial interval more consistent with that of MERS or SARS, with a mean around 8 days, I proceed to incorporate this uncertainty around the serial interval distribution by allowing specification of a distribution of distributions of serial intervals. So let’s retain the mean SI estimated by Li et al. of 7.5 days, with an SD of 3.4, but let’s also allow that mean SI to vary between 2.3 and 8.4 using a truncated normal distribution with an SD of 2.0. We’ll also allow the SD or the SI to vary between 0.5 and 4.0.
Recalculating with those SI distribution parameters and meta-parameters, we get these results:
Analysis above looks more reasonable, and clearly the Re is not falling just yet in Peru.
Let’s use data above to re-estimate Re. Bayesian methods are used, and the trace output below is from the MCMC (Markov-chain Monte Carlo) resampling methods used.
That is remarkably similar to our estimates based on an “uncertain” meta-distribution for the serial interval, above, which is encouraging.
Nonetheless some of the Re values are still rather high. Remember, these are instantaneous reproductive numbers, averaged over a sliding 7-day window. Let’s change that to a shorter window to see what the day-to-day changes in Re are, rather than based on an aggregation of the preceding 7-days. I’ll plot the results using a logarithmic scale with a red reference line at 1.0, representing the reproduction number below which the outbreak will start to die out.
So, it looks like the outbreak has yet to been brought under control in Peru, at least based on the published lab-confirmed possitive case numbers and the available serial interval data or estimates of its distribution.
It is clear that control measures have yet to work by looking at the daily incremental incidence numbers for Peru:
Here I will use a different approach such as linear and GLM regressions to fit COVID-19 outbreak in Peru:
## PERU -- 275989
## =============================== running models...===============================
## Linear Regression (lm):
##
## Call:
## lm(formula = y.var ~ x.var)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60645 -41420 -6584 38668 91693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67523.48 7230.56 -9.339 <2e-16 ***
## x.var 1593.79 78.89 20.203 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45230 on 156 degrees of freedom
## Multiple R-squared: 0.7235, Adjusted R-squared: 0.7217
## F-statistic: 408.2 on 1 and 156 DF, p-value: < 2.2e-16
##
## --------------------------------------------------------------------------------
## Linear Regression (lm):
##
## Call:
## lm(formula = y.var ~ x.var)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0975 -1.1730 0.2717 1.2182 2.0078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.509301 0.227997 -6.62 5.48e-10 ***
## x.var 0.104700 0.002488 42.09 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.426 on 156 degrees of freedom
## Multiple R-squared: 0.9191, Adjusted R-squared: 0.9185
## F-statistic: 1771 on 1 and 156 DF, p-value: < 2.2e-16
##
## --------------------------------------------------------------------------------
## GLM using Family [1] "poisson" :
##
## Call:
## glm(formula = y.var ~ x.var, family = family)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -201.52 -78.80 -41.87 37.44 111.56
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.0627897 0.0019922 3043 <2e-16 ***
## x.var 0.0432018 0.0000145 2979 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18288467 on 157 degrees of freedom
## Residual deviance: 911708 on 156 degrees of freedom
## AIC: 912996
##
## Number of Fisher Scoring iterations: 5
##
## --------------------------------------------------------------------------------
Here we can see the growth rate in Peru:
Finally, here we can observe daily trends based on current official data
## Processing Peru ...
After creating projections based on a simple SIR model, it is possible to use a more complex model to generate a sophisticated projection as well.
Here below I created a projection based up to 19 May 2020 data. Projection extends up to 14 days from last observed possitive case.
Created by Alexis Roldan